EPITA 2022 IML lab03_classification_01-fashion_mnist v2022-03-15_180239 by G. Tochon & J. Chazalon
This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
In the session you will practice to get more familiar with the following concepts:
This notebook contains the only part of the lab session.
Make sure you read and understand everything, and complete all the required actions.
Required actions are preceded by the following sign:

We will use the FashionMNIST dataset, created by Zalando Research.
Fashion-MNISTis a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intendFashion-MNISTto serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
Here's an example how the data looks (each class takes three rows):
![]()
Each training and test example is assigned to one of the following classes:
| Label | Description |
|---|---|
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
We first download the dataset, or load it if we alread have it...
If you are working on Windows, you will need to adapt those lines, or, altenatively, download the dataset directly from the official repo and put the files under the same directory.
%%bash
mkdir -p tmp_data
for file in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz ;
do
test -e tmp_data/${file} || wget -O tmp_data/${file} http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/${file}
done
# check we got some files in the work directory
!ls tmp_data/
t10k-images-idx3-ubyte.gz train-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-labels-idx1-ubyte.gz
...then we open it...
def load_data(path, kind='train'):
"""
Load data from `path`, using subset `kind`.
Parameters
----------
path: str
Path to directory where dataset files are stored.
kind: str (either "train" or "t10k")
Selects the subset to use: `"train"` for train set, `"t10k"` for test set.
Returns
-------
images, labels: Dataset subset content.
images: np.array of shape: (n_observations, n_features), dtype: np.uint8
Image data
labels: np.array of shape: (n_observations, ), dtype: np.uint8
Labels (0-9) for each observation.
"""
import os
import gzip
if kind not in ("train", "t10k"):
raise ValueError("kind must be either 'train' or 't10k'.")
labels_path = os.path.join(path, f"{kind}-labels-idx1-ubyte.gz")
images_path = os.path.join(path, f"{kind}-images-idx3-ubyte.gz")
with gzip.open(labels_path, 'rb') as lbpath:
labels = np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8)
with gzip.open(images_path, 'rb') as imgpath:
images = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16).reshape(len(labels), 784)
return images, labels
# Read the train set
train_img, train_labels = load_data("tmp_data", "train")
train_img.shape, train_img.dtype, train_labels.shape, train_labels.dtype
((60000, 784), dtype('uint8'), (60000,), dtype('uint8'))
# Read the test set
test_img, test_labels = load_data("tmp_data", "t10k")
test_img.shape, test_img.dtype, test_labels.shape, test_labels.dtype
((10000, 784), dtype('uint8'), (10000,), dtype('uint8'))
...and we plot a random selection of images.
print("The size of the images is (side of the square):")
print(np.sqrt(train_img.shape[1]))
The size of the images is (side of the square): 28.0
plt.figure(figsize=(15,15))
for ii in range(100):
image = train_img[ii]
label = train_labels[ii]
plt.subplot(10,10,ii+1)
plt.imshow(image.reshape((28,28)), cmap='gray')
plt.title(f"class: {label}")
plt.axis("off")
plt.tight_layout()
plt.show()
**Question:** Is the train set **balanced**?
# Is it balanced? Count the number of elements per class.
np.unique(train_labels, return_counts=True)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8), array([6000, 6000, 6000, 6000, 6000, 6000, 6000, 6000, 6000, 6000]))
TODO answer
Teacher:
Yes it is balanced (same number of elements for each class).
**Question:** Is the train set **sorted**?
# Is it sorted (quick check)?
train_labels
array([9, 0, 0, ..., 3, 0, 5], dtype=uint8)
TODO answer
TEACHER:
No, is is not sorted (by class). We must be careful when sampling data.
**Question:** Does the train set look noisy (at first sight)?
*Hint:* Display the first images for each class.
# TODO code
TODO answer
# teacher
for cls in range(10):
plt.figure(figsize=(14,7))
plt.tight_layout()
selection = train_labels == cls
for ii in range(50):
plt.subplot(5,10,ii+1)
plt.imshow(train_img[selection][ii].reshape((28,28)), cmap='gray')
plt.axis("off")
fig = plt.gcf()
fig.suptitle(f"Samples from class {cls}", fontsize=14)
plt.show()
TEACHER:
Some classes may be hard to discriminate, but intrinsically they look quite consistent (at first sight). Some samples are hard to identify though.
In the first exercise we will leverage scikit-learn's great classifiers to quickly get a baseline.
To avoid premature complexity, we will first focus on the case of a binary classifier, i.e. a case where data can only be classified into two classes. We usually label those classes 0 and 1.
Train a classifier to discriminate images from class 0 ("tee-shirt/top") and class 2 ("pull-over").
*Hints:*
- Use the `select_2_classes()` function below to generate a train and a test set.
- Use a `LinearSVC` classifier with default parameters (but custom seeding for reproducibility), unless you have some particular classifier you want to try. Note that most of the questions assume you will use the `LinearSVC` classifier.
def select_2_classes(X, y, label_a, label_b):
"""
Transforms our dataset to select only samples and labels which have either `label_a` or `label_b`.
Parameters
----------
X: np.array of shape: (n_samples, n_features)
Array of samples
y: np.array of shape: (n_samples, ); dtype: np.uint8
Array of labels for each sample (integer format)
label_a: integer
Value of the first label to keep
label_b: integer
Value of the other label to keep
Returns
-------
new_X, new_y: same type and shape as `X` and `y`
Selection of samples and labels whose labels are either `label_a` or `label_b`.
"""
selection_a_b = (y == label_a) | (y == label_b)
new_X = X[selection_a_b]
new_y = y[selection_a_b]
new_y = (new_y == label_a).astype(y.dtype) # could be removed for most of this lab
return new_X, new_y
# Creation of the train set
x_train, y_train = select_2_classes(train_img, train_labels, 0, 2)
x_train.shape, x_train.dtype, y_train.shape, y_train.dtype
((12000, 784), dtype('uint8'), (12000,), dtype('uint8'))
# TODO create the test set the same way
x_test, y_test = ???
x_test, y_test = select_2_classes(test_img, test_labels, 0, 2)
x_test.shape, x_test.dtype, y_test.shape, y_test.dtype
((2000, 784), dtype('uint8'), (2000,), dtype('uint8'))
Create the classifier
# TODO create the classifier here and train it
from sklearn.svm import LinearSVC
clf = LinearSVC(random_state=0, max_iter=5000)
%%time
clf.fit(x_train, y_train)
CPU times: user 18 s, sys: 46.2 ms, total: 18 s Wall time: 18 s
/home/jchazalo/.virtualenvs/iml_py3.8/lib/python3.8/site-packages/sklearn/svm/_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
LinearSVC(max_iter=5000, random_state=0)
Now that your classifier is trained, you should first check the results visually for a qualitative control.
**Using the `plot_some_results` and `plot_some_errors` functions provided below, display some predictions from the test set and control that they make sense.**
def plot_some_results(x_data, y_true, y_pred, /, num_elements=100):
"""
Plot the `num_elements` first results in a fancy way.
Parameters
----------
x_data: np.array of shape: (n_observations, n_features), dtype: np.uint8
Image data
y_true: np.array of shape: (n_observations, ), dtype: np.uint8
Expected (true) labels.
y_pred: np.array of shape: (n_observations, ), dtype: np.uint8
Predicted labels.
num_elements: int, default=100
Number of elements to plot.
Returns
-------
None
"""
num_cols = 10
num_rows = (num_elements + num_cols - 1) // num_cols
plt.figure(figsize=(15,15))
for ii in range(min(x_data.shape[0], num_elements)):
image = x_data[ii]
label_true = y_true[ii]
label_pred = y_pred[ii]
plt.subplot(num_rows,num_cols,ii+1)
plt.imshow(image.reshape((28,28)), cmap='gray')
if label_pred == label_true:
plt.title(f"OK: {label_true}")
frame_color = 'g'
else:
plt.title(f"ERR: E{label_true} -> P{label_pred}")
frame_color = 'r'
h, w = 28, 28
plt.plot([0, 0, w, w, 0], [0, h, h, 0, 0], frame_color, linewidth = 2)
plt.axis("off")
plt.tight_layout()
plt.show()
def plot_some_errors(x_data, y_true, y_pred, /, num_elements=100):
"""
Plot the `num_elements` first errors in a fancy way.
Parameters
----------
x_data: np.array of shape: (n_observations, n_features), dtype: np.uint8
Image data
y_true: np.array of shape: (n_observations, ), dtype: np.uint8
Expected (true) labels.
y_pred: np.array of shape: (n_observations, ), dtype: np.uint8
Predicted labels.
num_elements: int, default=100
Number of elements to plot.
Returns
-------
None
"""
selection_errors = y_pred != y_true
plot_some_results(x_data[selection_errors], y_true[selection_errors], y_pred[selection_errors])
# TODO
# ...
# plot_some_results(...)
plot_some_results(x_test, y_test, clf.predict(x_test))
# TODO
# ...
# plot_some_errors(...)
plot_some_errors(x_test, y_test, clf.predict(x_test))
Do the errors make sense?
Can you say for sure whether it is a tee-shirt/top or a pull-over in each case?
# You can save your classifier for future use if you want to.
# For more details, see:
# https://scikit-learn.org/stable/modules/model_persistence.html#model-persistence
# from joblib import dump, load
# dump(clf, 'tmp_clf/clf_type_param_moreinfo.joblib')
We are now going to evaluate the performance of the classifier using the integrated score method.
clf.score(x_test, y_test)
0.896
The score method reports the accuracy:
In the next section we will provide a more precise definition.
Implement the function `my_accuracy` below to get the same result.
def my_accuracy(y_true, y_pred):
"""
Computes the accuracy of some classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, )
True expected labels.
y_pred: np.ndarray of shape (n_samples, )
Predicted labels.
Returns
-------
accuracy: float
Accuracy of the classification results, i.e. the number of correctly classified elements (correct label)
divided by the total number of predictions.
"""
return ????
def my_accuracy(y_true, y_pred):
"""
Computes the accuracy of some classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, )
True expected labels.
y_pred: np.ndarray of shape (n_samples, )
Predicted labels.
Returns
-------
accuracy: float
Accuracy of the classification results, i.e. the number of correctly classified elements (correct label)
divided by the total number of predictions.
"""
return np.sum(y_true == y_pred) / y_true.shape[0]
We should get the same result as before here:
my_accuracy(y_test, clf.predict(x_test))
0.896
The accuracy is a very basic indicator which weights all errors equally. However, the cost of misclassification from A to B is not always the same as the cost of recognizing A instead of B (think of a fraud detection system for example).
The confusion matrix is the key to understand the core indicators: accuracy, precision, recall, etc.
For a binary classifier, labels are either 1 (True) or 0(False), and the confusion matrix is composed of only 4 elements:
True and the predicted value is True as well;False and the predicted value is False as well;True but were predicted as False;False but were predicted as True.Let us look at the confusion matrix of your classifier…
# You may need to adapt this line to match your variable names
sklearn.metrics.plot_confusion_matrix(clf, x_test, y_test)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f9884fe67c0>
What are the counts of true positives, true negatives, false positives and false negatives produces by your classifier?
TODO answer
The matrix is arranged in this way:
Predicted True |
Predicted False |
|
|---|---|---|
Expected True |
TP | FN |
Expected False |
FP | TN |
Based on these terms, the accuracy has the following definition:
\begin{equation} \large \mathrm{Accuracy} = \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} \end{equation}And precision and recall are:
\begin{equation} \large \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} \end{equation}\begin{equation} \large \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} \end{equation}
Compute the precision and recall of your previous classification results.
def my_binary_precision(y_true, y_pred):
"""
Computes the precision of some binary classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, ) and dtype bool
True expected labels.
y_pred: np.ndarray of shape (n_samples, ) and dtype bool
Predicted labels.
Returns
-------
precision: float
Precision of the binary classification results.
"""
tp = ???
fp = np.sum(~y_true & y_pred)
return ???
def my_binary_recall(y_true, y_pred):
"""
Computes the recall of some binary classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, ) and dtype bool
True expected labels.
y_pred: np.ndarray of shape (n_samples, ) and dtype bool
Predicted labels.
Returns
-------
recall: float
Recall of the binary classification results.
"""
tp = ???
fn = ???
return ???
def my_binary_precision(y_true, y_pred):
"""
Computes the precision of some binary classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, ) and dtype bool
True expected labels.
y_pred: np.ndarray of shape (n_samples, ) and dtype bool
Predicted labels.
Returns
-------
precision: float
Precision of the binary classification results.
"""
tp = np.sum(y_true & y_pred)
fp = np.sum(~y_true & y_pred)
return tp / (tp + fp)
def my_binary_recall(y_true, y_pred):
"""
Computes the recall of some binary classification results.
Parameters
----------
y_true: np.ndarray of shape (n_samples, ) and dtype bool
True expected labels.
y_pred: np.ndarray of shape (n_samples, ) and dtype bool
Predicted labels.
Returns
-------
recall: float
Recall of the binary classification results.
"""
tp = np.sum(y_true & y_pred)
fn = np.sum(y_true & ~y_pred)
return tp / (tp + fn)
# TODO compute precision and accuracy for the predictions of our classifier on the test set
# ...
my_binary_precision(y_test, clf.predict(x_test))
0.8361629881154499
my_binary_recall(y_test, clf.predict(x_test))
0.985
The F-score, finally, is the Harmonic mean of precision and recall:
\begin{equation} \large F_1 = \frac{2}{\mathrm{recall^{-1}} + \mathrm{precision^{-1}}} = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} = \frac{\mathrm{TP}}{\mathrm{TP} + \frac12 (\mathrm{FP} + \mathrm{FN}) } \end{equation}For most classifiers, it is possible to rank their prediction on the test set by probability, confidence, or some other form of score.
With scikit-learn, classifiers which can provide such information implement either:
predict_proba() method which assigns a True probability for each sample;decision_function() method which predicts confidence scores for samples.In the case of LinearSVN, the decision_function() returns the confidence score associated to each sample.
The confidence score for a sample is proportional to the signed distance of that sample to the hyperplane.
Using either of these methods, is is possible to plot, for each possible threshold, the numbers of TP, FP, FN, TN… or any other indicator derived from them.
Here is (below) the distribution of confidence score produced by the decision function (assuming you used a LinearSVC) **for the train set**.
**Where are the majority of the values?**
confidence_scores = clf.decision_function(x_train)
confidence_scores.shape, confidence_scores.dtype
((12000,), dtype('float64'))
plt.hist(confidence_scores, bins=100)
plt.title("Distribution of confidence scores")
Text(0.5, 1.0, 'Distribution of confidence scores')
TODO some thoughts about this distribution…
It show that the majority of values are in $[-10, 10]$ and pretty much centered around zero.
What is more interesting is to plot the distribution of such values according to the class the samples belong to.
plt.figure(figsize=(8,4))
plt.hist(confidence_scores[y_train==0], bins=50, color='b', label="Class 0", alpha=0.6)
plt.hist(confidence_scores[y_train==1], bins=50, color='g', label="Class 1", alpha=0.6)
plt.legend()
plt.title("Distribution of confidence scores for each class")
Text(0.5, 1.0, 'Distribution of confidence scores for each class')
Based on the previous plot, assuming each error (FP and FN) have the same cost, what is roughly the optimal decision threshold (on the x-axis) which will minimize the total error?
TODO answer
In the previous plot, the intersection between the two distributions is the set of samples which will be wrongly classified in the training set.
The optimal threshold, if all errors have the same cost, will be at the intersection of the two distributions.
sklearn.metrics.precision_recall_curve provides a very nice way to compute all the values that precision and recall would take by selecting each of the possible thresholds.
Here we will look at the values from the training set (because we will try to calibrate the decision function in the next section).
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_train, clf.decision_function(x_train))
We provide you with some extra visualization function which can help.
def plot_precision_recall_vs_threshold_vs_f1(precisions, recalls, thresholds):
f1 = 2 * precisions * recalls / (precisions + recalls)
plt.plot(thresholds, f1[:-1], 'r:', label="$F_1$ score")
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
amax_f1 = np.argmax(f1)
thr_best_f1 = thresholds[amax_f1]
plt.plot([thr_best_f1, thr_best_f1], [0, f1[amax_f1]], "k:", linewidth=2)
plt.xlabel(f"Thresholds (best for $F_1$ @ $T$ = {thr_best_f1:0.2f}: "
f"$F_1$ = {f1[amax_f1]:0.2f}; $P$ = {precision[amax_f1]:0.2f}; $R$ = {recall[amax_f1]:0.2f})",
fontsize=14);
plt.legend(fontsize=14) # loc="upper left", fontsize=16
plt.grid()
plt.ylim([0, 1])
plt.xlim([np.min(thresholds), np.max(thresholds)])
plt.figure(figsize=(16, 8))
plot_precision_recall_vs_threshold_vs_f1(precision, recall, thresholds)
By default, the `predict()` method will assume that values which are $\geq 0$ are `True` and values $\lt 0$ are `False`.
**Is $0$ the best possible threshold here to maximize the accuracy? To maximize $F_1$?**
TODO what about the calibration?
The calibration is bad. We can fix it in several ways, as we will see in the next two sections: post-correction and feature pre-processing.
Before looking a the calibration of the predictor in more detail, let us discover the PR and ROC curves.
Using precision and recall values computed for each threshold, it is possible to plot precision(t) vs recall(t) for each t (threshold).
This gives an idea of the different operation modes our system could have based on the threshold we choose:
sklearn.metrics.plot_precision_recall_curve(clf, x_test, y_test)
<sklearn.metrics._plot.precision_recall_curve.PrecisionRecallDisplay at 0x7f986028c2b0>
The ROC curve, finally, plots the true positive rate (aka sensitivity, recall, hit rate) vs the false positive rate (aka fall out, 1 minus specificity, etc.)
\begin{equation} \large \mathrm {TPR} ={\frac {\mathrm {TP} }{\mathrm {P} }}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} = \text{Recall} \end{equation}\begin{equation} \large \mathrm {FPR} ={\frac {\mathrm {FP} }{\mathrm {N} }}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}=1-\mathrm {TNR} \end{equation}It illustrates a kind of signal vs noise compromise: what is the amount of false positive (e.g. background noise) you will have to face to increase the recall (e.g. voice signal over radio transmitter).
sklearn.metrics.plot_roc_curve(clf, x_test, y_test)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f98867b3760>
Calibration should be done on validation set but we can illustrate the process here on the train set directly. As long as we do not calibrate on the test set, this cannot be very wrong…
Calibration (here) is about how to find the threshold which maximizes some metric.
Let us first compute the uncalibrated predictions, and compute their accuracy on the test set.
uncalibrated_predictions = clf.predict(x_test)
uncalibrated_predictions[:10]
array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1], dtype=uint8)
my_accuracy(y_test, uncalibrated_predictions)
0.896
sklearn.metrics.confusion_matrix(y_test, uncalibrated_predictions)
array([[807, 193],
[ 15, 985]])
Now, let us compute a new threshold on the train set which will maximize the accuracy, and evaluate the quality of the new decision on the test set.
Complete the code of the function `find_best_threshold_for_accuracy` below to compute the new decision with a new threshold, and analyse the results on the test set.
from sklearn.metrics import accuracy_score
def find_best_threshold_for_accuracy(y_true, uncalibrated_predictions):
"""
Find the best threshold which maximizes accuracy for a given set of predictions.
Parameters
----------
y_true: np.array of shape (n_samples, )
Expected predictions.
uncalibrated_predictions: np.array of shape (n_samples, ) and dtype np.float
Scores or probabilities assigned to each sample.
Returns
-------
best_thresh, best_acc: float, float
Best threshold and best accuracy obtained.
"""
thresholds = uncalibrated_predictions
best_acc = 0.
best_thresh = None
for t in thresholds:
pred_t = ????? # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< FIXME
acc_t = accuracy_score(y_true, pred_t)
if best_thresh is None or best_acc < acc_t:
best_acc = acc_t
best_thresh = t
return best_thresh, best_acc
def find_best_threshold_for_accuracy(y_true, uncalibrated_predictions):
"""
Find the best threshold which maximizes accuracy for a given set of predictions.
Parameters
----------
y_true: np.array of shape (n_samples, )
Expected predictions.
uncalibrated_predictions: np.array of shape (n_samples, ) and dtype np.float
Scores or probabilities assigned to each sample.
Returns
-------
best_thresh, best_acc: float, float
Best threshold and best accuracy obtained.
"""
thresholds = uncalibrated_predictions
best_acc = 0.
best_thresh = None
for t in thresholds:
pred_t = uncalibrated_predictions >= t
acc_t = accuracy_score(y_true, pred_t)
if best_thresh is None or best_acc < acc_t:
best_acc = acc_t
best_thresh = t
return best_thresh, best_acc
best_thresh, best_acc = find_best_threshold_for_accuracy(y_train, clf.decision_function(x_train))
best_thresh, best_acc
(1.2606260849604012, 0.9620833333333333)
# Some other options -- I like it much better because intersection regularizes the optimisation
# Intersection of Precision and Recall curves
intersection = np.searchsorted(precision > recall, True)
best_thresh = thresholds[intersection]
best_thresh
1.6794393038629978
# Compute the new decisions on test set using the best threshold
calibrated_predictions = clf.decision_function(x_test) > best_thresh
calibrated_predictions[:10]
array([False, False, True, False, True, True, False, False, False,
False])
my_accuracy(y_test, calibrated_predictions)
0.939
sklearn.metrics.confusion_matrix(y_test, calibrated_predictions)
array([[920, 80],
[ 42, 958]])
Is the accuracy on the test set better?
TODO Answer
Quite an improvement indeed… And calibration is even more important for deep networks!
But we may be doing things the wrong way: we know our features are not normalized!
Let us try some…
and learn how to use pipelines.
Pipelines are an elegant way to combine preprocessors (which have fit() and transform() methods) with predictors (wich have fit() and predict() methods) into a general object.
We encourage you to use this integrated way of combining pre-processing and classification because it prevents from forgetting to apply the same pre-processing to test data; a very common mistake!
To create pipelines, we can either use the Pipeline constructor, or the factory function make_pipeline.
Their signature is slightly different.
Create a pipeline which combines a `StandarScaler` with the same `LinearSVC` as we used before;
the train it on our binary case and evaluate its performance on our test set.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline
pipeline = make_pipeline(????) # FIXME
pipeline = make_pipeline(StandardScaler(), LinearSVC(random_state=0, max_iter=5000))
%%time
pipeline.fit(x_train, y_train)
CPU times: user 21.8 s, sys: 945 ms, total: 22.8 s Wall time: 22.8 s
/home/jchazalo/.virtualenvs/iml_py3.8/lib/python3.8/site-packages/sklearn/svm/_base.py:985: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
warnings.warn("Liblinear failed to converge, increase "
Pipeline(steps=[('standardscaler', StandardScaler()),
('linearsvc', LinearSVC(max_iter=5000, random_state=0))])
pipeline.score(x_test, y_test)
0.9455
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_train, pipeline.decision_function(x_train))
plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold_vs_f1(precision, recall, thresholds)
**How does the performance compares the our previous version?**
**Is the new predictor calibrated differently?**
TODO Answer
TEACHER
What are the step run by sklearn **during the training** in terms of calls to `fit()`, `transform()` and `predict()` for the preprocessor and for the classifier?
What are the step run by sklearn **during the testing/prediction** in terms of calls to `fit()`, `transform()` and `predict()` for the preprocessor and for the classifier?
TODO answer
Training process:
Prediction process:
Here we want to get the best possible model without looking at the test set.
We will split the train set into several parts, training on all but one part and evaluating on the left one, then repeating the operation with another part left out.
We will keep the best performing model trained with this process.
Using [`sklearn.model_selection.StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) to separate the train set into train and **validation** subsets automatically, train several models and keep the best one based on its accuracy **on the validation set**.
Finally, evaluate the performance of the best model **on the test set**.
*Hint*: look at the example in the documentation of [`sklearn.model_selection.StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold).
from sklearn.model_selection import StratifiedKFold
# TODO code
skf = StratifiedKFold(n_splits=10) # we leave 10% of the dataset for validation
best_model = None
best_score = 0.0
for ii, (train, val) in enumerate(skf.split(x_train, y_train)):
print(f"Processing split {ii}...")
xt = x_train[train]
yt = y_train[train]
clf = make_pipeline(StandardScaler(), LinearSVC(random_state=0, max_iter=5000))
print("\t training started")
clf.fit(xt, yt)
print("\t training complete")
xv = x_train[val]
yv = y_train[val]
score = clf.score(xv, yv)
print(f"\tScore: {score:0.3f}")
if best_model is None or score > best_score:
print(f"\tNew best model: score {best_score:0.3f} --> {score:0.3f}")
best_model = clf
best_score = score
print("")
print("KFold complete.")
print(f"Best validation score obtained: {best_score:0.3f}")
# Now compute the final score on the unseen test data
final_score = best_model.score(x_test, y_test)
print(f"Final test score: {final_score:0.3f}")
Is our new model performing better?
What would be good usages of cross validation?
Do we really need a stratified splitter here?
*Some recommended readings:*
- API: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection
- Cross validation: https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
- Hyper-parameter tuning: https://scikit-learn.org/stable/modules/grid_search.html#grid-search
- More about estimator's variance: https://scikit-learn.org/stable/modules/learning_curve.html#learning-curve
TODO answer
TEACHER
Comparing to training without cross-validation, this is not a tremendous gain here, but it can help when models have a lot of variance.
This is also a way to generate several models which can be ensembled.
Let us now try to optimize the meta-parameters of our predictor.
We will continue to work on 2 classes for now because it may be slow on all classes.
Complete the code below to recreate the pipeline you used previously and set the grid parameters so you explore different values for the `C` parameter of our `LinearSVC`.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# Create the pipeline object
pipeline = Pipeline([
("scaler", ???), # <<<<<<< FIXME
("linearsvc", ???) # <<<<<<< FIXME
])
# Create new parameter dictionary
grid_params = {
# Key = step name from pipeline + __ + hyperparameter, value = tuple of possible values
'linearsvc__random_state': (0, 3), # <<<<<<< REPLACEME
}
# Instantiate new gridsearch object
gs = GridSearchCV(pipeline, grid_params, n_jobs=6, verbose=4)
# Fit model to our training data
gs.fit(x_train, y_train)
# Score the model on our testing data
gs.score(x_test, y_test)
# Create the pipeline object
pipeline = Pipeline([
("scaler", StandardScaler()),
("linearsvc", LinearSVC(random_state=0, max_iter=5000))
])
# Create new parameter dictionary
grid_params = {
# Key = step name from pipeline + __ + hyperparameter, value = tuple of possible values
'linearsvc__C': (0.01, 0.1, 1, 10),
}
# Instantiate new gridsearch object
gs = GridSearchCV(pipeline, grid_params, n_jobs=6, verbose=4)
# Fit model to our training data
gs.fit(x_train, y_train)
# Score the model on our testing data
gs.score(x_test, y_test)
Pretty printing of the parameters
import pandas as pd
df = pd.DataFrame(gs.cv_results_)
df
Did we end up using a value for `C` different from the default one?
TODO Answer
TEACHER
yes, it seems that a different value is better adapted to our data.
In the section, we will train a multi-class classifier, and we will have a quick look at the multi-class classification strategies: OvO (one versus one) and OvR (one versus rest, aka one versus all).
How many classifiers do we need to train for the 10 classes for our dataset with:
- OvO strategy?
- OvA strategy?
TODO answer
TEACHER
OvA: $10$
OvO: $(10*10 - 10) / 2 = 45$ (all combinations minus same class divided by 2 because of symmetry)
Using a `SGDClassifier` (which will have less troubles with all this data), train a prediction system on the complete training set, and evaluate its performance on the test set. Do not forget to display some results and errors to make sure they make sense, and plot the confusion matrix.
Make sure your system's performance is way above the expected performance a random system would have!
from sklearn.linear_model import SGDClassifier
# TODO code
clf_full = make_pipeline(StandardScaler(),
SGDClassifier(random_state=0,
shuffle=False,
early_stopping=True,
n_jobs=6
))
%%time
clf_full.fit(train_img, train_labels)
CPU times: user 25.5 s, sys: 31 s, total: 56.6 s Wall time: 15.7 s
Pipeline(steps=[('standardscaler', StandardScaler()),
('sgdclassifier',
SGDClassifier(early_stopping=True, n_jobs=6, random_state=0,
shuffle=False))])
sklearn.metrics.plot_confusion_matrix(clf_full, test_img, test_labels)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f9860935610>
clf_full.score(test_img, test_labels)
0.8289
test_predictions = clf_full.predict(test_img)
plot_some_results(test_img, test_labels, test_predictions)
plot_some_errors(test_img, test_labels, test_predictions)
TEACHER
A random system would get a 10% accuracy on this balanced dataset; so 82% is much better than random guess.
or "Fashion MNIST is like MNIST: it is too easy."
Quickly, to see that some classes are really easy to discriminate in the problem.
Prepare train set
x_train, y_train = select_2_classes(train_img, train_labels, 1, 9)
x_train.shape, x_train.dtype, y_train.shape, y_train.dtype
((12000, 784), dtype('uint8'), (12000,), dtype('uint8'))
clf = LinearSVC(random_state=0)
Train classifier
%%time
clf.fit(x_train, y_train)
CPU times: user 133 ms, sys: 20.5 ms, total: 154 ms Wall time: 152 ms
LinearSVC(random_state=0)
Create test set
x_test, y_test = select_2_classes(test_img, test_labels, 1, 9)
x_test.shape, x_test.dtype, y_test.shape, y_test.dtype
((2000, 784), dtype('uint8'), (2000,), dtype('uint8'))
clf.score(x_test, y_test)
1.0
Those classes seem pretty easy to discriminate…
y_test_pred = clf.predict(x_test)
y_test_pred.shape, y_test_pred.dtype
((2000,), dtype('uint8'))
plot_some_results(x_test, y_test, y_test_pred)
plot_some_errors(x_test, y_test, y_test_pred)
<Figure size 1080x1080 with 0 Axes>
No error on test set!
Now you can try various classifiers, with appropriate evaluation. Some suggestions: k nearest neighbor, linear regression, SVM with non-linear kernel (RBF typically), random forest…
# code code code!
Using cross-validation try to find the best performing classifier with the best parameters from scikit-learn to solve the full problem (all classes).
Be ready of a night of computation. Do not burn your laptop; use some desktop computer or server!
# code code code!